If you get an error go to the packages panel, click install and type the name of the library (you only need to do this once).
library(MLTools)
library(fpp2)
library(readxl)
library(tidyverse)
library(lmtest) # contains coeftest function
library(tseries) # contains adf.test function
library(TSstudio)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
fdata <- read_excel("ARIMA_example.xlsx")
Visualize the first rows:
head(fdata)
## # A tibble: 6 × 1
## y
## <dbl>
## 1 7.07
## 2 7.01
## 3 7.06
## 4 7.15
## 5 7.21
## 6 7.15
Assume that there are no time gaps, and use the natural numbers as index.
fdata_ts <- ts(fdata)
head(fdata_ts)
## Time Series:
## Start = 1
## End = 6
## Frequency = 1
## y
## [1,] 7.070480
## [2,] 7.014591
## [3,] 7.063700
## [4,] 7.149487
## [5,] 7.209597
## [6,] 7.147241
and we get some basic information
ts_info(fdata_ts)
## The fdata_ts series is a ts object with 1 variable and 1000 observations
## Frequency: 1
## Start time: 1 1
## End time: 1000 1
We use the column index to select a time series
y <- fdata_ts[ , 1]
We use the following plots to assess the stationarity of the time series. With ggtsdisplay we get a time plot and both the ACF and PACF.
ggtsdisplay(y)
We are going to decide if we want to apply a Box-Cox transformation
Lambda <- BoxCox.lambda.plot(y, window.width = 10)
## `geom_smooth()` using formula = 'y ~ x'
# Lambda <- BoxCox.lambda(y) #other option
z is the transformed time series if Box-Cox is applied.
Note: If you decide that you do not want to use
Box-Cox, set useLambda <- FALSE in the next code
line:
useLambda <- TRUE
if(useLambda) {
z <- BoxCox(y, Lambda)
} else {
z <- y
Lambda <- 1
}
We can plot together the original and transformed time series to see the result of the transformation.
autoplot(y, linewidth=0.25, color = "red") +
autolayer(z, color = "blue", alpha=0.5)
When the ACF decreases very slowly the time series needs differentiation. We analyze the ACF of the (possibly Box-Cox transformed) time series.
ggtsdisplay(z, lag.max = 25)
Two alternative ways of checking for stationarity: the first one is to use the augmented Dickey-Fuller (adf) test of the null: data non-stationary and non seasonal (see (Hyndman and Athanasopoulos 2021), Unit Root Tests and the video therein)
adf.test(z, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: z
## Dickey-Fuller = -3.3705, Lag order = 9, p-value = 0.05852
## alternative hypothesis: stationary
Check the size of the p-value. The fpp3 book also discusses the kpss test, that we can apply uncommenting the next code line.
# tseries::kpss.test(z, null = "Trend")
A second way: the ndiffs function uses the previous
tests to suggest a differencing order that will make the time series
stationary:
ndiffs(z)
## [1] 1
Either way, now you need to decide the number d of
differences to apply, if any:
d <- 1
If differencing is needed we apply it here:
Bz <- if(d > 0){
diff(z, differences = 1)
} else {
z
}
We use the ACF and PACF of the Bz time series to identify the significant lags and propose the ARMA structure of the model:
ggtsdisplay(Bz, lag.max = 25)
Now we propose a structure of the model by selecting the values of (p, d, q) where d is for the number of differences (already decided).
p <- 2
q <- 0
This is the chosen (p, d, q) structure:
cat(paste0(c("p", "d", "q")," = ", c(p, d, q)))
## p = 2 d = 1 q = 0
We fit the model with the above values. Observe that we apply it to
the original (untransformed, undifferenced) time series y, because the
Arima function includes arguments for both Lambda and
d.
Note: also remember that if you want a constant term you need to include it here.
arima.fit <- Arima(y,
order = c(p, d, q),
lambda = Lambda,
#lambda = 1,
include.constant = TRUE)
summary(arima.fit)
## Series: y
## ARIMA(2,1,0) with drift
## Box Cox transformation: lambda= 0.4240885
##
## Coefficients:
## ar1 ar2 drift
## 0.7919 -0.4986 -0.0019
## s.e. 0.0274 0.0274 0.0015
##
## sigma^2 = 0.001056: log likelihood = 2006.94
## AIC=-4005.87 AICc=-4005.83 BIC=-3986.24
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0002542038 0.07572928 0.06016474 -0.01057471 1.481481 0.7310768
## ACF1
## Training set 0.0009794669
If some coefficient is non significant you should reconsider the choice of model order.
coeftest(arima.fit)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.7919024 0.0273957 28.9061 <2e-16 ***
## ar2 -0.4985500 0.0273900 -18.2019 <2e-16 ***
## drift -0.0018654 0.0014532 -1.2836 0.1993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following root plot is used to assess the stationarity and invertibility of the model.
autoplot(arima.fit)
Again, if the residuals are not white noise, you should reconsider the choice of the ARMA model order.
CheckResiduals.ICAI(arima.fit, bins = 100, lag = 25)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0) with drift
## Q* = 16.226, df = 22, p-value = 0.8045
##
## Model df: 3. Total lags used: 25
Check the p-value of the Ljung-Box test above. The null hypothesis for this test can be rephrased as the series is not white noise.
autoplot(y, series = "Real", size=2, alpha=0.5) +
forecast::autolayer(arima.fit$fitted, series = "Fitted")
We need to select the prediction horizon h
y_est <- forecast(arima.fit, h=12)
The result is a forecast type object. It contains not
only the forecasted values, but also uncertainty intervals and extra
information about the fit. Printing the forecast object shows some of
this information:
y_est
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1001 2.632242 2.560116 2.705524 2.522401 2.744788
## 1002 2.622788 2.476332 2.774112 2.400759 2.856202
## 1003 2.613486 2.414535 2.821562 2.312873 2.935441
## 1004 2.608535 2.379704 2.849548 2.263442 2.982120
## 1005 2.606956 2.360258 2.867880 2.235351 3.011836
## 1006 2.605882 2.344692 2.883082 2.212820 3.036388
## 1007 2.603531 2.327063 2.898021 2.187900 3.061310
## 1008 2.599919 2.306830 2.913373 2.159796 3.087671
## 1009 2.595947 2.286319 2.928432 2.131516 3.113835
## 1010 2.592320 2.267474 2.942455 2.105576 3.138211
## 1011 2.589145 2.250559 2.955337 2.082299 3.160553
## 1012 2.586160 2.234825 2.967350 2.060701 3.181440
The forecast object can also be plotted.
autoplot(y_est)
In the following plot we subset the original time series to improve the visualization of the h forecasted values.
autoplot(subset(y, start = 900)) +
autolayer(y_est)
Let us go over the same dataset but this time we will use functions from other libraries to perform a basic train/test split and get some performance measures.
n <- length(y)
n_test <- floor(0.2 * n)
library(TSstudio)
y_split <- ts_split(y, sample.out = n_test)
y_TS <- y_split$test
y_TR <- y_split$train
Now we will briefly go over the same model identification and fitting steps, but using the training values. ### Box-Cox
Lambda <- BoxCox.lambda.plot(y_TR, window.width = 10)
## `geom_smooth()` using formula = 'y ~ x'
useLambda <- TRUE
if(useLambda) {
z <- BoxCox(y_TR, Lambda)
} else {
z <- y_TR
Lambda <- 1
}
When the ACF decreases very slowly the time series needs differentiation. We analyze the ACF of the possibly Box-Cox transformed time series.
ggtsdisplay(z, lag.max = 25)
ADF test for stationarity
adf.test(z, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: z
## Dickey-Fuller = -3.2778, Lag order = 9, p-value = 0.07451
## alternative hypothesis: stationary
Also check ndiffs
ndiffs(z)
## [1] 1
Choose value of d
d <- 1
If differencing is needed we apply it here:
Bz <- if(d > 0){
diff(z, differences = 1)
} else {
z
}
ACF and PACF of Bz
ggtsdisplay(Bz, lag.max = 25)
Now we propose a structure (p, d, q)
d selected above
p <- 2
q <- 1
Fit the model
arima.fit <- Arima(y_TR,
order=c(p, d, q),
lambda = Lambda,
include.constant = TRUE)
Summary of training errors and estimated coefficients
summary(arima.fit)
## Series: y_TR
## ARIMA(2,1,1) with drift
## Box Cox transformation: lambda= 0.5378993
##
## Coefficients:
## ar1 ar2 ma1 drift
## 0.8550 -0.5517 -0.0455 -0.0028
## s.e. 0.0553 0.0382 0.0665 0.0019
##
## sigma^2 = 0.001495: log likelihood = 1466.79
## AIC=-2923.58 AICc=-2923.5 BIC=-2900.16
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 4.912899e-05 0.08057908 0.06449987 -0.0134504 1.367553 0.7126226
## ACF1
## Training set -0.002289066
Statistical significance of estimated coefficients
coeftest(arima.fit)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.8550364 0.0552765 15.4683 <2e-16 ***
## ar2 -0.5516655 0.0381541 -14.4589 <2e-16 ***
## ma1 -0.0455003 0.0664709 -0.6845 0.4937
## drift -0.0028116 0.0018708 -1.5029 0.1329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Root plot
autoplot(arima.fit)
Check residuals
CheckResiduals.ICAI(arima.fit, bins = 100, lag = 25)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1) with drift
## Q* = 16.845, df = 21, p-value = 0.7205
##
## Model df: 4. Total lags used: 25
autoplot(y_TR, series = "Training", color = "blue", size=2, alpha=0.7) +
autolayer(arima.fit$fitted, series = "Fitted")
# Perform a future forecast
y_frcst <- forecast(arima.fit, h=n_test)
Plot of the final part of the training set and the forecast
autoplot(subset(y_TR, start = 550), series = "Training", color = "blue", size=2, alpha=0.7) +
autolayer(subset(arima.fit$fitted, start = 550), series = "Fitted") +
autolayer(y_frcst, series ="Forecast") +
autolayer(y_TS, series = "Test")
Quoting Hyndman’s fpp3:
The basic concept is that we forecast the time series of interest y
assuming that it has a linear relationship with other time series
x
The time series x may even be one of the components of y, e.g. the
trend. We will explore this idea below.
We fit a linear model with the tslm function. You can
read about tslm in Chapter 5 of
FPP2 (Hyndman’s book, 2nd ed.). In the third edition it has been
replaced with TSLM.
In this case we use the trend keyword to indicate that
we use the trend as input variable.
tslm_model <- tslm(y_TR ~ trend)
Check the significance of the coefficients
summary(tslm_model)
##
## Call:
## tslm(formula = y_TR ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.51870 -0.32676 0.04508 0.40835 1.12794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.360e+00 3.646e-02 201.9 <2e-16 ***
## trend -5.939e-03 7.887e-05 -75.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5152 on 798 degrees of freedom
## Multiple R-squared: 0.8766, Adjusted R-squared: 0.8765
## F-statistic: 5670 on 1 and 798 DF, p-value: < 2.2e-16
We can now use this linear model to do a forecast
y_tslm <- forecast(tslm_model, h=n_test)
# And explore the forecast graphically
autoplot(subset(y_TR, start = 550), series = "Training", color = "blue") +
autolayer(y_tslm, series ="TSLM") +
autolayer(y_frcst, series ="Forecast", alpha = 0.25) +
autolayer(y_TS, series = "Test")
Zooming in the beginning of the forecast
autoplot(subset(y_TR, start = 790), series = "Training", color = "blue") +
autolayer(window(y_frcst$mean, end=820), series="Arima") +
autolayer(window(y_tslm$mean, end=820), series="tslm") +
ylab(" ")
Now we have trained two models, and we can compare them using their performance on the test set.
forecast::accuracy(y_frcst, y_TS)
## ME RMSE MAE MPE MAPE MASE
## Training set 4.912899e-05 0.08057908 0.06449987 -0.0134504 1.367553 0.7126226
## Test set 5.753839e-02 0.36826130 0.28350236 0.6245963 12.864040 3.1322577
## ACF1 Theil's U
## Training set -0.002289066 NA
## Test set 0.974213635 6.126543
forecast::accuracy(y_tslm, y_TS)
## ME RMSE MAE MPE MAPE MASE
## Training set -2.338407e-17 0.5145344 0.4199386 -1.354819 9.253299 4.639665
## Test set 2.211409e-01 0.4830432 0.3582143 8.209833 15.704776 3.957708
## ACF1 Theil's U
## Training set 0.9759082 NA
## Test set 0.9760039 7.475992
Some explicit formulas:
# mean(abs(residuals(tslm_model)))
# sqrt(mean(residuals(tslm_model)^2))
As you can see the error of the linear model is worse than that of the Arima model. A exploration of the residuals of the linear model shows that they are clearly not white noise.
ggtsdisplay(residuals(tslm_model))
That means that the linear model is not able to extract as much information from the time series as Arima does.